Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP, RFC] doc/memos: Added RDM on high level timer API requirements and common features #12970

Closed
wants to merge 22 commits into from

Conversation

maribu
Copy link
Member

@maribu maribu commented Dec 16, 2019

Contribution description

Early draft for an RDM on a high level timer API.

Testing procedure

n/a

Issues/PRs references

None

@maribu maribu added State: WIP State: The PR is still work-in-progress and its code is not in its final presentable form yet Discussion: RFC The issue/PR is used as a discussion starting point about the item of the issue/PR Area: timers Area: timer subsystems Area: RDM Area: Discussions on RIOT Developer Memos labels Dec 16, 2019
@maribu
Copy link
Member Author

maribu commented Dec 16, 2019

I think I now requested reviews from every maintainer that contributed to either xtimer or ztimer and every reviewer of ztimer.

@miri64
Copy link
Member

miri64 commented Dec 17, 2019

If it is a draft, please keep to the naming scheme for drafts:

Initially, a proposed RDM is identified by its temporary file name of shape rdm-draft--.md (for example: rdm-draft-baccelli-rdm-format.md) instead of a number assigned at a later stage. At this stage, the RDM's header indactes status "draft".

@maribu
Copy link
Member Author

maribu commented Dec 17, 2019

If it is a draft, please keep to the naming scheme for drafts:

Done

- The implementation must only disable IRQs for short periods of time
- The implementation must cause as little IRQs as possible and its ISRs must
be as short as possible
- The implementation should provide means to execute callbacks in thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is IMO out of scope

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or should maybe go to the utility functionality...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or should maybe go to the utility functionality

Yep. It is already in the feature requirements below utilities. To me, this is an important thing to improve real time capabilities in a convenient way. (Obviously, one could just write their own handlers to call event_post(); but I think having a utility function for that can be much easier to use.)

I'm not sure if it makes sense to have this written twice. To me this is both a feature requirement and a realtime requirement; so I just added it twice.


## Precision Requirements

- Timer callbacks should never fire prematurely (+- 1 tick of the underlying
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important point. Is "-1" acceptable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that formulation, clearly yes. I think the effort needed to prevent this could be quite significant. So if no one cares about it triggering early by 1 tick enough to speak up, we should just allow this :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the effort needed to prevent this could be quite significant.

set(val) { val += 1; ... }, or run_callback() { while(now() < target_time) {}; ... can achieve this.

We had a discussion some time ago, where people convinced me that "at least as" semantics are the only sane ones. Don't care about one tick? Set the timeout one less. (Having written that, I realize it also works the other way. :) )

The impact of this is very different depending on us, ms, or even second ticks are used.

I'm OK with keeping -1 for now.

@miri64 miri64 changed the title [WIP, RFC] doc/memos: Added RDM0002 on high level timer API [WIP, RFC] doc/memos: Added RDM on high level timer API Dec 17, 2019
@maribu
Copy link
Member Author

maribu commented Sep 21, 2020

I updated the title to contain both the survey and the requirements analysis and also added every contributor as author (sorted by last name, alphabetically).

@maribu maribu changed the title [WIP, RFC] doc/memos: Added RDM on high level timer API [WIP, RFC] doc/memos: Added RDM on high level timer API requirements and common features Sep 21, 2020
@emmanuelsearch
Copy link
Member

I agree with @aabadie here. From my point of view, the most crucial part of this document for our next steps is the requirements we set for our high-level timer API. Would be bizarre not to put this in the title of the RDM.

while not being to strict to rule out reasonable trade-offs e.g. between
ROM/RAM requirements and latency. An *O(n)* latency is e.g. feasible to
implement with a sorted linked list and therefore a reasonable lower bar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(n) is an unbound function, so this justification is inconsistent. For real-time requirements ask for O(1), I suppose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every concrete application, there is an upper bound for the number n of software timers being active at the same time. So for every concrete application, O(n) has an upper bound.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for any finite set of numbers. So we could also go for an O(exp(n))?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(n) is an unbound function, so this justification is inconsistent. For real-time requirements ask for O(1), I suppose.

This is intentionally written to not be O(1), because of the reasonable trade-offs as described. We know how to do O(n) by simply using linked lists and sorting on insert, and we understand the runtime implications well. We don't know how to do this in O(1) without limiting the number of allowed timers.

We do multiplexing here, so one hardware timer (or more, but a small number) needs to be used to schedule N timeouts.
So when adding one of these timeouts (using a set(interval) function), that new timeout needs to be added into the list of already queued timers. Using linked lists, this is an O(n) operation. Using an array and binary search, this is O(logn). We need to be careful what to require here, because if we break O(n*logn) for sorting n timers, we might attract a lot of attention.

It is possible to add an allowed timers limit (that is lower than available memory, e.g., 16 or 128) and enforce this.
We'd suddenly have O(1) on paper (meeting the requirements), but wouldn't have gained anything, other than that now a multitude of functions have an error case (timer_sleep() might fail with "maximum number of timers reached") that needs to be checked everywhere, which is unreasonable.

TL;DR let's not require something we don't know how to implement. This is phrased as "yadda yadda reasonable trade-offs".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete timer management can be done in O(log(n)), which is almost constant in the number range we consider realistically. Further optimizations can apply, if algorithms account for the deadline of the next firing timer, which can be determined in constant time.
Current preliminary HiL-tests of @pokgak show a rather poor behavior of xtimer and a somewhat unclear signature of ztimer. We will publish once ready.
In any case, it is important that timer firing remains within predictable bounds independent of the number of instantiated timers - otherwise real-time gets out of control.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log(25) ~ 1.39, exp(25) ~ 7.2 x 10**10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log(25) ~ 1.39, exp(25) ~ 7.2 x 10**10.

Both bounded, which is exactly what a realtime system needs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?

Copy link
Member

@bergzand bergzand Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?

Do you want to dive into the numerical analysis of the numbers you provided or can we agree here that both numbers are constants?

A realtime system does not require the system to be fast. It does however require the system to respond within the specified deadlines. This deadline can be in the order of microsecond or in the order of years, depending on the requirements.

Furthermore, a complex O(1) algorithm, taking 10 seconds performs worse compared to a simple O(n) algorithm taking 10 ms + n * 1 ms. when n is known to be bounded to 100.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a solution here, why don't we rephrase a bit and specify that it must be trivial to determine the maximum number of active timers for an application in order to ensure bounded execution time?

@kaspar030
Copy link
Contributor

@MichelRottleuthner do you want to add requirements, or are you fine with this for now?
@tcschmidt are you OK with the title now?

Copy link
Contributor

@jue89 jue89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a deeper look into this RDM. Its been a while since the last time I did that.

All in all, this RDM covers all features we require for our use cases. A nice piece of work!

Just a minor comment. (Sorry for kicking in this late during the polishing process!)

doc/memos/rdm-draft-buschsieweke-timer-api.md Show resolved Hide resolved
@fjmolinas
Copy link
Contributor

ping @maribu!

@kaspar030
Copy link
Contributor

This PR started out to be the RDM on timers, so we could base any further work on it with confidence. But currently it is a collection of requirements and some survey on features other timer implementations provide. It does not go into specific design questions for a high-level timer implementation.

At the same time, ztimer got merged and provides many solutions to problems we have with xtimer. But migrating to ztimer as the default timer is a bit stalled for lack of an RDM approved design, which this RDM was supposed to give, but doesn't.

I think we all agree that doing comprehensive research on "how to do high-level timers right[tm]" is a lot of work. Many of us think that ztimer is a step in the right direction, and we should migrate. AFAIK noone thinks xtimer can be turned into whatever we need without substantial rework. And xtimer effectively blocks RIOT from sleeping, which should be a fix first huuuuge bug for an OS supposed to run on low-power devices.

Can we agree on most of the requirements, and concentrate on that for this PR?

We could then move forward with ztimer and check if it meets this PR's requirements, in a follow-up RDM.
Also, work on a proper "how to do high-level timers right[tm]" RDM can start.

(would it make sense to even drop the survey part here and save it for the more extensive research RDM?)

@benpicco
Copy link
Contributor

But migrating to ztimer as the default timer is a bit stalled for lack of an RDM approved design, which this RDM was supposed to give, but doesn't.

I don't think the lack of RDM stops ztimer from being used by default.
AFAIR there are still features missing from ztimer (ztimer_mutex_lock_timeout()) that prevent it from replacing xtimer entirely.

@tcschmidt
Copy link
Member

Even though I believe a rigorous and exhaustive design document would be essential for reworking the timer API, I don't see any of this here. It rather appears as an - somewhat related - informational document about "some thoughts about timers without rigorous justification". This certainly cannot serve as a foundation for redesign. The question is rather, whether this is helpful and needed or whether this can simply be discarded.
Any opinions?

@maribu
Copy link
Member Author

maribu commented Oct 25, 2020

Let me recapitulate the history of this:

  1. We have with xtimer a high level timer API that is known for years to not be insufficient for any low power scenario
  2. With ztimer an alternative was proposed that is known to be a huge step in the right direction
  3. The transition to ztimer was blocked with the argument that an RDM is needed first
  4. An RDM was started to address this
  5. The goals of the RDM were scaled down to reflect the lack of progress and engagement as well as the inability to reach a consensus on anything more ambitions than a list of basic requirements
    • The small list of features and superficial implementation details of timer APIs of competing OS could also stay (at least so far) - good to know that at least lists unopinionated facts don't spawn endless discussions
  6. Even for this unambitious RDM it seems to be difficult to reach a consensus - despite that this list of requirements is relatively unopinionated due to a lack of prioritization of the requirements most likely to end up on opposing sides of trade-offs
  7. Just when it seems that finally an agreement on this least ambitions outcome of the RDM is in reach - people start to call for burying the RDM
  8. Meanwhile: RIOT remains completely unsuitable for any low power scenarios in default configuration - despite a solid alternative that performs well in low power scenarios is already in the code base

An particularly interesting observation is that the very persons that were the first to call out for an RDM as requirement for any progress in regard to the timer API are seemingly the last persons to contribute to it.

I'm willing to keep pushing this RDM to some sort of conclusion. But please keep me out of the loop for any follow up.

@benpicco
Copy link
Contributor

  • The transition to ztimer was blocked with the argument that an RDM is needed first

Is that really the case? If ztimer works and is a drop in replacement for xtimer, nobody will care about the RDM anymore.
With the 802.15.4 radio HAL the RDM is used to discuss how things should be implemented, then implement them. It's a living document that aids development.
Having an RDM for the sake of having an RDM is pointless.

What really blocks transition to ztimer is that not all features of xtimer are covered yet.
Once it's a drop in replacement we can just replace xtimer with ztimer_xtimer_compat.

@maribu
Copy link
Member Author

maribu commented Oct 25, 2020

  • The transition to ztimer was blocked with the argument that an RDM is needed first

Is that really the case?

From the discussion in the devel mailing list I pretty much got the impression. Here are some quotes of the key arguments leading to this RDM:

Side note: I think we should move the discussion to an RDM PR for the
design document and discuss there to not get lost.

@MichelRottleuthner, 10. 12. 2019, full email for context

What do you think about an RDM PR?

@MichelRottleuthner, 11. 12. 2019, full email for context

I'd favor to not merge it before we worked on the RDM as it may show
that there are things that should be done differently.

[...]

I'll provide the collected information to the RDM.

[...]

No, I mean if "having a non functional timer for many years" happens again.
I think the way how xtimer did is job over these years is not something
we want to repeat.

So an RDM would have prevented this? I doubt it. xtimer's ISR buggyness
was known for years even without an RDM. We were just inexperienced
(or ignorant, or incompetent, your pick).

Yes I honestly think it could have helped to on that end to thoroughly
discuss the key problems with a broader audience of interested and
competent people.

@MichelRottleuthner, 13. 12. 2019, full email for context

@tcschmidt did not explicitly ask for an RDM, but for a "problem statement and design document" and wanted "a clear and falsifiable problem statement" prior to even discussing the adoption of ztimer, see full email.

To my best understanding, both @MichelRottleuthner and @tcschmidt saw completion of an RDM (or in case of @tcschmidt an RDM-like document) as requirement for adoption of ztimer. So I started this off:

in order to push this forward I just opened a PR for an RDM at
#12970. This PR is in an early state and
both feedback and help will be greatly appreciated.

@maribu, 16. 12. 2019, full email for context

I think it is fair to say that nobody's expectations towards this RDM were met. And honestly, I see no indication that a second try should work out better.

I'm not sure if what @benpicco was suggesting to rather work on improving ztimer and fill the gaps rather than endlessly stalling progress because of RDMs that don't get anywhere. But if so: strong +1

Btw: #15052 would be a huge step towards ztimer_mutex_lock_timeout().

@waehlisch
Copy link
Member

  • The transition to ztimer was blocked with the argument that an RDM is needed first

Is that really the case?

several people asked for an RDM to have a base for an informed discussion. it was not about "Having an RDM for the sake of having an RDM".

quite often sound solutions need informed discussions, to prevent trial and error. ... as someone, i think it was Kaspar, said in the timer breakout session in Helsinki: why should we be successful this time? we know from the past that code that is not enough.

@emmanuelsearch
Copy link
Member

I think it is fair to say that nobody's expectations towards this RDM were met. And honestly, I see no indication that a second try should work out better.

@maribu thanks a lot for the work you put in this.

Maybe we are falsely lead to see this glass half empty.
On the positive, half-full side:

  • we have collaboratively gathered a comprehensive list of requirements;
  • we have a solution designed and partly implemented -- but which already seems to improve the status quo;
  • some already use this solution in production.

Let's not lose sight of the above, which can be considered useful progress.
In that respect, I consider this RDM already useful -- independent of its stalling for other reasons.

@benpicco
Copy link
Contributor

benpicco commented Nov 9, 2020

As this seems to be the place to put general ideas about the high-level timer API, here is another one:

I think the high-level timer could be simplified if it configured the low-level timer in down-counting mode.
If we want to set a timeout in n ticks, right now we have to

  • get the current time
  • add n, make sure we did not overshoot already

instead, with down-counting timers we only have to

  • set timeout to n

Now this simplifies the set_timeout() function at the expense of the get_time() function - however, I think the former is executed a lot more often.

What do you think?

@tcschmidt
Copy link
Member

@benpicco Interesting perspective. I suppose it needs experimental validation, though.
Also, low-level timers should see more requirements or desires from the high-level side: A lot more interrelation with #15046 would clearly progress insights w.r.t. the real problems of timers.

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Jun 2, 2021
@SemjonWilke SemjonWilke removed their request for review June 4, 2021 11:22
@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Jun 4, 2021
@fjmolinas fjmolinas added this to To do in Ztimer Jun 11, 2021
@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jun 22, 2021
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Mar 2, 2022
@stale stale bot closed this Apr 17, 2022
Ztimer automation moved this from To do to Done Apr 17, 2022
@maribu maribu deleted the timer-rdm branch January 21, 2023 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: RDM Area: Discussions on RIOT Developer Memos Area: timers Area: timer subsystems Discussion: RFC The issue/PR is used as a discussion starting point about the item of the issue/PR Process: needs >1 ACK Integration Process: This PR requires more than one ACK State: stale State: The issue / PR has no activity for >185 days State: WIP State: The PR is still work-in-progress and its code is not in its final presentable form yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet